Below we will practice finding best fit GLMs on two different datasets. The first data set is from my thesis work and includes the total length (mm) and sex of fin clipped Kentucky Arrow Darters. The second dataset is from Dryad and focuses on the chemistry of acid mine drainage in freshwater ecosystems of New Zealand.

Kentucky Arrow Darter Data

Our first GLM practice run will be on the Kentucky Arrow Darter data set, comparing fish sex to total length (mm).

A male Kentucky Arrow Darter that is approximately 75 mm in length (Left) and a female Kentucky Arrow Darter that is approximately 65mm in length (Right) Photos provided by Matt Thomas KFWS.

First let’s upload the dataset.

kad_tissue_data <- read.csv("~/Desktop/Analytics/assignment1glm/GLM-Practice/data/kad_tl_data.csv")

Convert categorical Male and Female data to binary data (0 and 1)

kad_tissue_data$sex_binary <- factor(kad_tissue_data$sex, levels=c("M", "F"), labels = c(0,1))

Now change that factor data to interger

str(kad_tissue_data)
## 'data.frame':    107 obs. of  4 variables:
##  $ tissue_id : chr  "APSU-KAD-500" "APSU-KAD-501" "APSU-KAD-502" "APSU-KAD-503" ...
##  $ tl_mm     : int  65 85 78 91 71 63 71 65 70 72 ...
##  $ sex       : chr  "F" "M" "F" "M" ...
##  $ sex_binary: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 2 2 2 1 ...
kad_tissue_data$num_sex_binary<-as.numeric(as.character(kad_tissue_data$sex_binary))

Let’s plot the data on a box and whisker plot to see if there are any obvious trends

ggplot(kad_tissue_data, aes(sex,tl_mm)) +
  geom_boxplot(color="blue", fill="green4", alpha=0.2) +
  geom_jitter(height = 0, width = .15)+
  xlab ("Sex") +
  ylab ("Total Length (mm)") +
  labs(title="Influence of Kentucky Arrow Darter Sex on Total Length (mm)")

As we can see, males had a higher mean total length than females. This suggests that males may be typically bigger than females.

We need to now find a proper GLM for our dataset

Poisson

pois1 <- glm(tl_mm~num_sex_binary, family= poisson(link=log), data= kad_tissue_data)
summary(pois1)
## 
## Call:
## glm(formula = tl_mm ~ num_sex_binary, family = poisson(link = log), 
##     data = kad_tissue_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3298  -0.7556  -0.0348   1.0499   2.3012  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     4.30811    0.01641 262.583  < 2e-16 ***
## num_sex_binary -0.10028    0.02303  -4.355 1.33e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 184.74  on 106  degrees of freedom
## Residual deviance: 165.79  on 105  degrees of freedom
## AIC: 820.72
## 
## Number of Fisher Scoring iterations: 4

the degrees of freedom and residual deviance are not similar, suggesting this model is not best fit

Let’s try another GLM

Binomial Distribution

Let’s check to see if binomial distribution fits

binom1 <- glm(sex_binary~tl_mm, family= binomial, data= kad_tissue_data)
summary(binom1)
## 
## Call:
## glm(formula = sex_binary ~ tl_mm, family = binomial, data = kad_tissue_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1275  -1.1139   0.7029   1.0436   1.5436  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  4.93252    1.51999   3.245  0.00117 **
## tl_mm       -0.06778    0.02118  -3.201  0.00137 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 147.88  on 106  degrees of freedom
## Residual deviance: 135.83  on 105  degrees of freedom
## AIC: 139.83
## 
## Number of Fisher Scoring iterations: 4

So it’s not a perfect fit, but better than poisson.

This is an autoplot to help us understand if it really fits the model

kad_tissue_data$tl_mm100 <- kad_tissue_data$tl_mm/100
fit.1 <- glm(num_sex_binary~tl_mm100, data=kad_tissue_data, binomial(link="logit"))
autoplot(fit.1)

Some of these don’t look great, but it is not awful, so we are going to take it.

Let’s try a binned plot

x <- predict(fit.1)
y <- resid(fit.1)
binnedplot(x, y)

Final binomial plot

ggplot(kad_tissue_data, aes(tl_mm,num_sex_binary)) +
  geom_point(color="green4") +
  geom_smooth(method="glm", method.args=list(family="binomial"(link="logit"))) + 
  xlab ("Total Length (mm)") +
  ylab ("Probability of being Male (0) or Female (1)")
## `geom_smooth()` using formula 'y ~ x'

Binomial GLM is the best fit for our Kentucky Arrow data set examining the relationship between sex and total length (mm).

Now it’s time to move onto the next data set!

New Zealand acid mine drainage water chemistry


Acid mine drainage contaminating various ecosystems across the planet.

First, let’s import the dataset

water_chemistry <- read.csv("~/Desktop/Analytics/assignment1glm/GLM-Practice/data/water_chemistry.csv")

Now let’s do a raw plot to get the general idea of the data trends

ggplot(water_chemistry,aes(Fe,Cond)) +
  geom_point(color="green4") +
  geom_smooth(color="blue") +
  xlab ("Iron") +
  ylab ("Conductivity") +
  labs(title="Raw Fit: How does iron effect conductivity?")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Let’s try to do a Poisson summary to see if that is a fitting model

pois2 <- glm(Fe~Cond, family= poisson(link=log), data= water_chemistry)
summary(pois2)
## 
## Call:
## glm(formula = Fe ~ Cond, family = poisson(link = log), data = water_chemistry)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -66.637  -42.018  -29.247    0.295  162.800  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 7.381e+00  5.146e-03  1434.3   <2e-16 ***
## Cond        1.408e-03  3.074e-06   457.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287228  on 24  degrees of freedom
## Residual deviance:  70982  on 23  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 5

The residual deviance and degrees freedom have a large difference, so this is not a good model

Let’s try Gaussian

gaus2 <- glm(Fe~Cond, family= gaussian, data= water_chemistry)
summary(gaus2)
## 
## Call:
## glm(formula = Fe ~ Cond, family = gaussian, data = water_chemistry)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5910.3  -1993.2   -107.3    466.3  10668.0  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -452.219   1042.662  -0.434    0.669    
## Cond          12.915      1.251  10.326 4.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 17311224)
## 
##     Null deviance: 2244001448  on 24  degrees of freedom
## Residual deviance:  398158141  on 23  degrees of freedom
## AIC: 491.53
## 
## Number of Fisher Scoring iterations: 2

It somehow got worse (there is even a larger difference between the residual deviance and DF)

Let’s plot Gaussian

ggplot(water_chemistry, aes(Fe,Cond)) +
  geom_point() +
  geom_smooth(method="glm", method.args=list(family="gaussian")) + 
  xlab ("Total Length (mm)") +
  ylab ("Probability of being Male (0) or Female (1)")
## `geom_smooth()` using formula 'y ~ x'

The plot for Gaussian does not look awful, but I think we can do better

One more try, this time Gamma GLM

gam <- glm(Fe~Cond, family= Gamma(link="sqrt"), data= water_chemistry)
summary(gam)
## 
## Call:
## glm(formula = Fe ~ Cond, family = Gamma(link = "sqrt"), data = water_chemistry)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4384  -0.5768  -0.2562   0.3429   0.9961  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.14515    2.01703   8.004 4.25e-08 ***
## Cond         0.09333    0.01188   7.858 5.82e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.4210324)
## 
##     Null deviance: 71.737  on 24  degrees of freedom
## Residual deviance: 10.821  on 23  degrees of freedom
## AIC: 421.33
## 
## Number of Fisher Scoring iterations: 6

Since the difference between residual deviance and DF is at the lowest we have seen for this dataset, this suggests that Gamma square root is the best fit model

last, but not least, let’s plot the Gamma square root GLM

ggplot(water_chemistry, aes(Fe,Cond)) +
  geom_point() +
  geom_smooth(method="glm", method.args=list(family="Gamma"(link ="sqrt"))) + 
  xlab ("Total Length (mm)") +
  ylab ("Probability of being Male (0) or Female (1)")
## `geom_smooth()` using formula 'y ~ x'

Looks funky, but it’ll do?